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Abstract — In the cloud computing environment resources are 
accessed as services rather than as a product. Monitoring this 
system for performance is crucial because of typical pay-per- 
use packages bought by the users for their jobs. With the huge 
number of machines currently in the cloud system, it is often 
extremely difficult for system administrators to keep track of all 
machines using distributed monitoring programs such as Gan- 
glia 1 which lacks system health assessment and summarization 
capabilities. To overcome this problem, we propose a technique 
for automated anomaly detection using machine performance 
data in the cloud. Our algorithm is entirely distributed and 
runs locally on each computing machine on the cloud in order 
to rank the machines in order of their anomalous behavior 
for given jobs. There is no need to centralize any of the 
performance data for the analysis and at the end of the analysis, 
our algorithm generates error reports, thereby allowing the 
system administrators to take corrective actions. Experiments 
performed on real data sets collected for different jobs validate 
the fact that our algorithm has a low overhead for tracking 
anomalous machines in a cloud infrastructure. 

I. Introduction 

Cloud Computing [1] refers to the infrastructure in which 
applications are delivered as services over the Internet. 
These infrastructures are supported by very large networked 
distributed machines. Users typically can pay for the time 
they would like to use these resources, e.g. CPU usage per 
hour or storage costs per day. This mode of computation is 
beneficial to both the user and provider for several reasons: 

• by allowing pay-per-use model, the users can run their 
jobs with less cost investment compared to owning the 
machines themselves 

• since there is a cost associated with the loan of the 
resources, users will always have an incentive to return 
them, when no longer needed 

• an easy way for the cloud provider to add resources 
when the demands are not met anymore 

With the introduction of any new technology, there is always 
a need for developing techniques for health assessment of 
these systems. This is true even in the case of clouds, where 
providers strive for availability and responsiveness since 
expectations on the side of the users are high. System admin- 
istrators in charge of such systems have a daunting task in 
maintaining them given hundreds and thousands of machines 

1 ganglia.sourceforge.net/ 


in the system. In case of failures, these faults may quickly 
propagate causing wide spread damage. Therefore system 
administrators would like to automatically detect these faults 
as early as possible for early mitigation strategies. 

Event monitoring programs such as Ganglia 2 provides a 
web based visualization interface for allowing the system 
administrators to view different parameters pertaining to the 
health of each of the machines in the distributed infrastruc- 
ture. Detailed list of the parameters are given in Section V. 
Given there are hundreds to thousands of machines in the 
system, visual inspection of system performance may be too 
late or nearly impossible. Moreover, it is also imperative to 
isolate the fault to a few subset of variables (fault isolation). 
An automated fault detection and isolation technique is 
necessary in such scenarios. 

In this paper, we describe an automated fault detection 
framework for cloud system FDCS which runs on top of the 
Ganglia system. The algorithm is entirely decentralized; as 
a result does not burden any single machine with excessive 
workload and at the same time does not require all the 
data to be centralized for execution. FDCS takes all the 
measurements of Ganglia into consideration and reports a 
ranked list of the machines based on its anomaly or fault 
score. Moreover, for each machine in this list, a system 
administrator can display the most faulty variable which 
caused the anomaly. The algorithm uses distance based 
anomaly definition to identify if a machine is faulty or not. 
It is extremely fast and can run continuously on changing 
data, thereby allowing an uninterrupted monitoring of the 
machine performance. Using FDCS, one can take corrective 
actions early before they become fatal faults and thereby 
degrading the overall system performance. 

The rest of the paper is organized as follows. In Section 
II we discuss some previous work related to this area of 
research. Next in Section III we discuss the notations and 
problem definition. In Section IV we present our fault detec- 
tion and isolation (FDCS) framework. Empirical evaluation 
is presented in Section V. Finally, we conclude and discuss 
some future directions in Section VI. 


2 ganglia. sourceforge.net/ 


II. Related Work 

In this section we present some work related to this area 
of research. 

Arshad et al. [2] presents a framework for intrusion 
detection and diagnosis for clouds. The goal of the paper 
was to map the input call sequences to one of the five 
severity levels: “minimal”, “medium”, “serious”, “critical”, 
and “urgent”. The authors have used decision trees for this 
task. The tree learns rules which can perform predictions 
on unseen instances. Experiments with publicly available 
system call sequences from the University of New Mexico 
(UNM) show that the algorithm exhibits good performance. 
A similar approach was also developed by Zheng el al. [3] 
and [4]. The last paper uses canonical correlation analysis 
(CCA) for tracking maximally correlated subspaces over 
time. One problem with both these techniques is that they 
both need labeled examples for training which are difficult 
to acquire. 

Most of the existing techniques for failure detection 
are rule-based [5] which defines a set of watchdogs. The 
method comprises of monitoring a single sensor using some 
hard thresholds. Whenever, the sensor value crosses the 
threshold, an alarm is raised. However, this threshold needs 
to be changed for different types of jobs to prevent missed 
detections and false alarms. 

Bodik el al. [6] develop a method for identifying time 
cycles in machine performance which fall below a certain 
threshold. They use quantiles of the measured data to 
statistically quantify faults. They optimize the false positive 
rate and provide the user to directly control it. This method 
was evaluated on a real datacenter running enterprise level 
services giving around 80% detection accuracy. However, 
as with some of the previous techniques, this method too 
requires labeled examples. An overview article on this topic 
is available at [7]. 

Pelleg et al. [8] explore failure detection in virtual ma- 
chines. They use decision trees to monitor counters of the 
system. First of all, this method requires labeled instances 
for training and. Moreover, the counters which are moni- 
tored are manually detected which reduces the scope of its 
general applicability. It is only suitable for well managed 
settings that include predictable workloads and previously 
seen failures. 

Some data mining techniques have also been applied for 
monitoring distributed systems e.g. the Grid Monitoring 
System (GMS) by Palatin et al. [9] and the fast outlier 
detection by Bhaduri et al. [10]. GMS uses a distributed 
distance-based outlier detection algorithm, which detects 
outliers using the average distance to k nearest neighbors. 
Similar to our method, GMS is based on outlier detection 
and is unsupervised and requires no domain knowledge. But 
the detection rate of GMS can be very slow due to the 
quadratic time complexity of fc-nn computation. The authors 


in [10] propose to speed up this computation using fast 
database indexing and distributed computation. 

Gabel et al. [11] presents a technique for latent fault de- 
tection on clouds. The proposed framework is unsupervised 
and based on statistical tests for fault detection. The main 
idea behind it is to compare machines performing the same 
task at the same time. A machine is flagged as abnormal 
when it deviates from the normal behavior. The authors 
demonstrate three tests within this framework and provide 
theoretical guarantees on the false detection rates of the 
proposed tests. The experiments are performed on several 
production services of various sizes and natures, including 
ones using virtual machines. However, this method is not 
distributed, thereby requiring one machine to run the tests. 

III. Notations and Problem Definition 

In this section we present some notations which are 
necessary for discussing our FDCS framework. 

Let Pi , . . . , P p be p machines in the cloud infrastructure 
connected to each other via a communication infrastructure 
such that the set of (one-hop) neighbors of Pi, T, is known 
to Pi. Each Pi holds a dataset Di (e.g. its status or log file) 
containing n vectors each in We assume 

• Disjoint property: Di n Dj = 0, V* f j 

• Global property: D = Uf = i A 

In real applications, it is not feasible to compute D due to 
massive data sizes, changing datasets or both. In this paper, 
we have only introduced this notation to formally define our 
global fault detection task via distributed processing. 

Given two user-defined parameters t, k > 0, let N k (x , D ) 
denote the set of k nearest neighbors from { I) \ { x } } 
to x (with respect to Euclidean distance with ties broken 
according to some arbitrary but fixed total ordering -<). 
Let Sk(x,D) denote the maximum distance between x and 
all the points in Nk(x,D) i.e. the distance between x and 
its fc-nearest neighbors in D. 8 k {x,D) can be viewed as 
an outlier ranking function. Let Ot, k {D) denote the top 
t points (outliers) in D according to 8 k (., D). In the rest 
of the description, for simplicity, we rewrite N k (b,D), 
S k (b,D) and O t , k (D) as N k (b), S k (b) and O k - 

Definition 3.1 (Distributed fault detection): Given 
integers f, k > 0, and dataset Di at each machine P , , the 
goal of distributed fault detection algorithm is to compute 
the outliers O k (in D = |J Df). 

In the above definition, we have assumed that the dis- 
tributed outlier detection algorithm produces the same set of 
outliers as its centralized counterpart [12]. The distributed 
algorithm that we discuss in this paper guarantee global 
correctness. 


IV. Fault Detection in Cloud Systems (FDCS) 

In this section we describe our Fault Detection in Cloud 
Systems (FDCS) framework in which the participating ma- 
chines in a cloud computing environment can collaboratively 
track the performance of other machines in the system 
and raise an alarm in case of faults. Our algorithm relies 
on in-network processing of messages, thereby making it 
faster than the brute force alternative approach of data 
centralization. Moreover, as we discuss in this section, it 
also allows fault isolation — determining which features are 
most faulty — which is valuable to take remedial actions. 

In our distributed setup, we assume that there is a central 
machine in the cloud infrastructure called reporter which 
does the final reporting of all the outliers. We also as- 
sume that all computational entities P\, ... ,P p form a uni- 
directional communication ring (except the leader machine 
If) i.e. any machine Pi can communicate with the machine 
with the higher id Pi + 1, 1 < * < p. Furthermore, each 
machine holds its own data partition Di while the test points 
are either sent by Pq or read from the disk. 

At any point of time, Pq maintains a current list of / 
outliers O k found so far. These are the points which, by 
definition, currently have the highest anomaly scores SfJ. D) 
on the global dataset D. When the algorithm starts, Ok 
is empty and it gets updated as new candidate outliers 
are received from Pi,. . . ,P p . Another quantity which the 
reporter needs to maintain is the cutoff threshold c which is 
initially set to — oo and it monotonically increases in value 
as more and more outliers are found. Whenever, Ok changes, 
it is set to the smallest value in Ok and then broadcast to all 
the other machines in the cloud for more efficient pruning 
of outlier points. 

In FDCS, each worker has two modes of operation push 
and pull. Alg. 1 gives the pseudo code for the push mode. 
The goal of the push mode is to test a block of data read from 
the memory, populate its fc-nn based on its local dataset, 
prune the points which are less than the current threshold c, 
and then send the residual number of test points to the next 
machine in the ring. The details of this step are as follows. 
Machine Pi maintains a threshold c, it has received from 
the reporter Pq. Initially c, = -oo. For each point b in the 
test data block B, machine Pi also maintains: 

• Ck(b ) — the fc-nearest neighbors found thus far for b 

• r b = max{||& - y\\ : y £ C k (b)} 

Initially, Ck(b) ■£- 0 and = 0 for each point b £ B. 
The algorithm populates C k (b) for b and checks to see if 
the current score of b is below Ci i.e. if 77 , < c,. If this is 
true, then the point is no longer tested and pruned; otherwise 
b along with its nearest neighbors found so far C k (b) and 
77 , are forwarded ( pushed ) to the next machine Pi + \ for 
validation. 

In the second phase of FDCS, which is the pull phase, 
the goal of the algorithm is to check the received buffer for 


Algorithm 1: FDCS push mode at any machine Pi. 

Procedure PUSH_Anom() 
begin 

for all blocks of data in Di do 
B getNextBlock(Di); 
for all points b € B do 
|_ £k(b) £- 0 ; 

for all points x £ Di do 
for b £ B, b ^ x do 

if dist(b,x) < Tb or \C k (b)\ < k then 
Update Ck(b) with x by removing 
the farthest point; 

Recompute rb', 
if rb < Ci then 

remove b from B\ 

_ Ti £- Ti + 1 ; 

for b £ B do 

Send (b, Ck(b),Tb) to machine P i+ 1 
mod p\ 

Call PULL_Anom(); 


Algorithm 2: FDCS pull mode at any machine Pi. 

Procedure PULL_Anom() 
begin 

for all x £ received buffer do 

Extract (x,J\fJ\fk(x),r x ) from received buffer. 
Update Ck(x) using N k (x) and Afffk{x)', 
Update r x \ 

if r x > Ci then 

if x originated in machine Pi then 

Send (x,r x , Ck(x)) (a potential outlier 
message) to the reporter machine (Pq)', 

else Send (x, Ck(x), r x ) to machine Pj + i 
mod p; 

else ^ n + 1; 


messages, extract the anomalies and their nearest neighbors 
and merge the nearest neighbors with the existing ones. 
The pseudo code is shown in Alg. 2. For every point x 
in the received buffer, P, finds the nearest neighbors from 
AfAfk ( x ) (which are the best set of fc neighbors found so far) 
and Di. The neighbor list and the value of r x are updated 
accordingly. As a result, if r x becomes less than c,, then 
x is pruned. Otherwise, if x originated in Pi itself, it has 
survived the pruning of all the machines and is sent to the 
leader machine If (since it can be a potential outlier data 
point). If x did not originate on Pi, is forwarded to Pi+i 
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Figure 1. Execution of distributed algorithm. The leftmost picture shows the setup: the test points are color coded to show which block is assigned to 
which machine. Second picture shows that assignment. Third figure shows how the non-pruned points are tested at the other machines. 


with the updated nearest neighbors. Machine Pi then goes 
back to the push mode and begins testing the new set of 
points. In any step of the execution, if any machine gets 
a new cutoff threshold c, it immediately sets a <— c and 
resumes the processing. 

Alg. 3 shows the tasks executed by the leader machine in 
FDCS. It initializes the outlier list O k to null. Whenever it 
receives a new potential outlier it does one of the following: 

• If Ok contains less than t outliers, x is added to Ok 

• If Ok contains t, outliers, the outlier already in Ok with 
the smallest score is replaced by x. 

If due to either of these computations, the outlier list 
becomes full, the cutoff is set to the score of the smallest 
outlier and it is then broadcast to all the machines. 

Fig. 1 shows a snapshot of the distributed algorithm. The 
leftmost figure shows how 3 machines are connected in a 
ring. The test points are shown in the middle, color coded to 
show that each block is assigned to one machine. The second 
figure shows its initial assignment. As the test blocks move 
in the ring, each machine prunes points as nearest neighbors 
are found. As a result, the size of the test blocks shrinks. 
This is shown in the last figure. 

One critical component of any distributed algorithm is the 
termination criterion. In FDCS this can be implemented in 
one of two ways. Each machine Pi keeps track of r t , the total 
number of points that it has pruned, and the leader machine 
keeps track of p, the total number of points it received as 
potential outliers. Periodically the leader polls the workers 
for their values of t^’s. Whenever Y^i=i T i + p = \D\, 
the leader sends a terminate message to all the machines. 
Alternatively, each machine can send a termination signal 
to the leader when the remaining test block size becomes 
zero. 

A. Fault Isolation 

In FDCS, it is fairly easy to isolate the attribute or feature 
which caused the outlier score to be high. Let Xt be the 
entity with the highest anomaly score (i.e. 6k{xt,D)) and 
2 /i , j/ 2 , ■ • • , yk be its fc-nearest neighbors. Then, the anomaly 


Algorithm 3: FDCS at master machine 
Output: Ok, the set of outliers 
Initialization: O k 0; 

if (x,r x , Ck{x)) is received then 

p<- p + U 
if \O k \ < t — 1 then 
|_ Add x to Ok', 

if \O k \ = t - 1 then 

c min{5 fc (r/, D) : y £ O k }', 
Broadcast c to all machines; 

if \Ok\ > t then 

if r x > min {5 k (y,D) : y £ O k } then 
Drop y £ Ok with minimum 8k', 
Add x to Ok', 

c £- min{5 fc (r/, D) : y £ O k }; 
Broadcast c to all machines; 


score is: 

1 k 

8k{x, D) = - dist(x, yi) 

h i= 1 

where dist(x,yi ) is the squared euclidean distance between 
x and yp. 

d 2 

dist{x,yi) = ^2 
i=i 

This shows that the overall score can be decomposed 
amongst its individual components and the contribution of 
the j-th (j = 1 : d) variable towards the outlier score is: 

2=1 

This is the quantity that we have used in our experiments 
as the contribution of the j-th feature towards the overall 
score. 




B. Efficient Preprocessing for Faster Computation 

It has been shown earlier in [10] [13] that distance based 
algorithms suffer from computational overhead due to its po- 
tential quadratic time complexity. To overcome this, Bhaduri 
el al. [10] proposed a novel reordering technique of the data. 
In the main technique, the test points are ordered according 
to their distance to a fixed (randomly chosen) point in space, 
with the largest being the one tested first. Moreover, when 
searching the fc-nn of a single point, the data is processed 
in a spiral fashion as shown in Fig. 2. They have shown 
that this search strategy exploits better spatial locality, and 
therefore, shorter running times. Also by ordering the test 
points in largest to smallest distance to a fixed point, it is 
intuitive that the cut off may increase faster, resulting in 
better pruning. We have used this index at each machine of 
our distributed algorithm to execute the local computation 
faster. 



Figure 2. Description of the index. Left figure shows a dataset with normal 
points in blue, outliers in red and the reference and test point. The right 
figure shows the order in which the test points are processed with the points 
farthest from the reference point being processed first. 


V. Experiments 

In this section we describe an empirical evaluation of our 
FDCS algorithm. 

A. Infrastructure Description 

FDCS algorithm is implemented in C/C++ using MPI 
architecture for message passing. We have run all our 
experiments in a cluster infrastructure at NASA containing 
128 nodes with 16 machines each having two, quad core 
Intel Xeon 2.66 GHz processors and 8 GB of memory, 
running Red Hat Linux. Cluster jobs are managed by the 
open source torque PBS scheduler. All machines have an 
NFS mounted raid array for data storage from a central 
machine connected through Gigabit Ethernet. 

B. Data Description 

The data collection for this experiment has been done 
using the cluster performance parameters recorded by the 


Ganglia monitoring system version 3.0.7. There are a total 
of 30 parameters measured here which cover different per- 
formance aspects of the cloud (cluster in this case) such as 
CPU usage, RAM usage, disk access, secondary memory 
access, job submission time, job completion time, boot time 
of the machine and so on. The parameter list is shown in 
Table V-C. The system monitors the performance parameters 
every 15 seconds and logs the average for each 6 minute 
interval. The files are exported daily at this resolution in 
comma separated format. 

C. Experimental Setup 

For our experiments, we monitored the cluster perfor- 
mance in a controlled environment by submitting a fixed 
set of 64 jobs to run on 8 machines for 3 days. Our job 
consists of reading 200 MB numerical data followed by a 
kernel and SVD computation, and finally writing the solution 
on disk files. The code written in MATLAB is shown in 
Figure 3. The FDCS algorithm in our experiment uses the 
last 27 parameters. 

D. Results 

The cluster performance data for each of the 8 machines 
concatenated as 6 minute composites for 3 days is stored 
locally at each machine and we run the distributed FDCS 
algorithm on this data to identify the top 50 global out- 
liers. The outliers identified are unique <machine-id-time 
interval> tuples in this data set. The report generated by 
FDCS identifies the most frequent machine id in this list of 
top 50 outliers and returns that machine as the highest ranked 
faulty machine for the given job and time period. Figure 
4 shows a possible report generated as the output of the 
FDCS algorithm. The report lists the top k (user specified) 
number of anomalies from the entire data set. For each of the 
anomalies, the algorithm computes the anomaly scores and 
also the respective weights associated with the parameters 
responsible for the anomalous behavior. The histogram on 
the right of Figure 4 shows the counts of the most anomalous 
machines in the top k list. The most frequently occurring 
machine id in the top k list is designated as the most faulty 
machine in the list. 

The FDCS algorithm can not only identify the most faulty 
machine for a job, but also can isolate the cause of the 
fault by indicating the parameter which behaves in the most 
erratic fashion compared to the others. In our analysis, 
the cluster shows no signs of anomaly and, therefore, we 
have artificially injected faults for demonstration purposes. 
We have made the free swap space of machine number 8 
decrease by 80% for 10 consecutive intervals towards the 
tail end of the job and then run the FDCS algorithm on this 
data set. We see that the algorithm reports machine 8 as 
the most anomalous machine and the free swap space and 
the processor load as the two most anomalous features in the 
data set. Figure 5 shows the plot of these two features for the 


job schedule 

date, time, boottime 

network 

bytes_in, bytes_out, pkts_in, pkts_out 

processor 

cpu_aidle, cpu_idle, cpu_nice, cpu_num, cpu_speed, cpu_system, cpu_user, cpu_wio 

process 

load_fifteen, load_five, load_one, proc_run, proc_total 

main memory 

mem_buffers, mem_cached, mem_free, mem_shared, mem_total, part_max_used 

storage 

disk_free, disk_total, swap_free, swap_total 


Table I 

List of performance parameters obtained using Ganglia 


tic 

SRCDir= instruct. SRCDir; 
filelist=dir ( [SRCDir , 1 *.mat' ] ) ; 
filelist={filelist.name} ; 

K=zeros (length (filelist) ) ; 

%% Read Files and build sub kernel %% 
for i=l: length (filelist) 

Flightl=load( [SRCDir, filelist{i} ] ) ; 
for j=l: length(filelist) 

Flight2=load ( [ SRCDir , f ilelist{ j } ] ) ; 

K(i,j ) =mean (mean (exp (Flightl. Flight. data( : ,1: 10) ) ) )+mean (mean (exp (Flight2. Flight. data( : ,1:10)))); 
end 

end 

Results. Runtime . Re adBui 1 dSubKe r ne 1 = to c ; 

%% Build full kernel%% 

Kfull=zeros (length(filelist) *6) ; 

count=l; 

for i=l:6 

for j =1 : 6 

if (mod (count, 2) ) 

Kfull ( ( i— 1 ) *length(filelist)+l: i*length(filelist) , ( j-1) *length( filelist) +1: j*length( filelist) ) =K; 
else 

Kfull ( (i-1) *length( filelist) +1: i*length(filelist) , ( j-1) *length(filelist)+l: j*length(filelist) ) =inv(K) ; 
end 

end 

end 

%% Solve for SVD%% 

[U,S,V]=svd (Kfull) ; 

Results . Runtime . Bui ldBigKSolveSVD= toe -Re suits . Runtime . Re adBui 1 dSubKe rne 1 ; 

%% Write out SVD%% 

csvwrite ( [ 1 /data2/bmatthew/IDU_Cluster_Test/U' ,num2str (instruct. pNum) , 1 . csv' ] ,U) ; 
csvwrite ( [ 1 /data2/bmatthew/IDU_Cluster_Test/S ' ,num2str ( instruct. pNum) , 1 . csv 1 ] ,S) ; 
csvwrite ([ 1 /data2/bmatthew/IDU_Cluster_Test/V' ,num2str (instruct. pNum) , 1 .csv 1 ],S) ; 

Re sul ts . Runtime .Write OutRe sul ts = to c -Re sul ts . Runtime . Bui 1 dB i gKS o 1 ve S VD ; 


Figure 3. Matlab code for fictitious job used to measure cluster performance 


entire job span. The red curve represents the time series for 
machine 8 while the blue curve represents the most normal 
time series for the same feature. We call the machine with 
the lowest frequency of occurrence in the top k list as the 
most normal machine. 

In another scenario, we have run our experiment in a 
regular cluster environment with multiple other jobs running 
simultaneously. For this data, we identify the features that 
occur the maximum number of times in the top k anomaly 
list. The processor load and memory cache appear to be 
the two most frequent parameters identified to be most 
anomalous. Figures 6 and 7 shows the time series of both 
of these features for the entire 26 hour period that we 
have monitored the cluster system. This experiment validates 
that the anomalies identified by the centralized and the 


distributed algorithm are identical. Most of the anomalies 
for the cache memory are outside of our submitted job 
execution indicating that the other job(s) running must have 
been extremely memory intensive. On the other hand, quite 
a few anomalies in the processor load variable occur during 
the execution of our submitted job, indicating that the out 
job is a computation intensive job adding to the processor 
load. 

VI. Conclusion 

Given a cloud infrastructure with hundreds to thousands 
of machines, it is always a challenge for the system admin- 
istrators to monitor the health of the machines. Monitoring 
programs such as Ganglia only allow the administrators 
to visualize the performance of all the machines using a 
web based GUI. As the scale of the system increases, it 



(a) CPU load vs. time 


<D 1 
O 1 
03 

Q. H 
CO 1 
Q. 

C/3 

<D 1 
CD 


.0242 

1.024 

.0238 

.0236 

.0234 

.0232 

1.023 


x 10 



— Anomalous Node (Node 8) 
— Normal Node (Node 1) 




5 10 15 20 25 

Time (Hours) 

(b) Free swap space vs. time 


Figure 5. Time series plots of two most anomalous features for the most faulty machine identified by the FDCS algorithm 


Fault Report 


Cluster Name: IDU 
Submission date: 08. 15. 2011 
Start time: 09:08:22 


Cluster Job ID: XXXXX 
Finish date: 08. 18. 2011 
Finish time: 08:48:16 


Top k (=10) Faults Identified 


1. <\I-5 08152011_2016> 

2. <M-8 08182011_0118> 

3. <M-8 08172011_2312> 

4. <M-8 08172011_2318> 

5. <M-1 08162011_1516> 


6. <M-8 08172011_2330> 

7. <M-1 08162011_1522> 

8. <M-8 08172011_0106> 

9. <M-4 08162011_0702> 

10. <\I-4 08152011_0012> 


Most Faulty Machine 


M-8 

List of fault parameters 

□ swap_free 

□ cpu_system 



Figure 4. Sample report generated for identifying the top k outliers in the 
performance data. The report highlights the most faulty machine from the 
top k list. 
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Figure 6. Time series plot of CPU load for the entire monitoring duration 
with anomaly time points highlighted for Machine 8 
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Figure 7. Time series plot of cache memory for the entire monitoring 
duration with anomaly time points highlighted for Machine 8 


is imperative to develop automated methods to detect the 
faulty machines and isolate the causes before these faults 
have cascading effects on the entire system. By replacing 
the human in the loop by an automated fault detection tech- 
nique, the response time decreases dramatically. Our FDCS 
framework achieves this goal by deploying a distributed 
outlier detection algorithm that does not require data to be 
centralized, allowing extremely fast detection. FDCS has a 
reporting system which returns the top few faulty machines 
along with the reasons as to why they are faulty. 

As part of future work, we plan to deploy this system to 
large production systems to test the performance of FDCS. 
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